The marketing team is planning next year's marketing strategies and they are willing to actualize their approach, considering the bad results they have had with their campaigns during the last years.
The new plan is based on implementing segmentation strategies, something that they have not done before.
Because segmentation is one of the main strategies utilized in marketing it has been proved to be effective, this is going to be the objective of this project: to provide information about the existing segments in the customer base, about their qualities, and what distinguishes them.
At this initial point of this project, there are many questions that come to mind. They are a mix of questions that need to be answered by the business, and others that can be answered by this data science project. Let's describe them all first, to have a clear picture of what is the exploration:
As a data scientist, the main challenge here is to find clear patterns and groups in the data, identifying their qualities and what makes them different from each other, which can later be interpreted as customer segments. Ideally, we will be able to find two groups that are the most different as possible between them. This knowledge can bring clarity to the business and marketing team about who their clients are and how to improve their strategies to better achieve their business goals. Data science offers a robust technique and perspective to help the business and marketing team with trustworthy information and knowledge based on data, so they can make decisions about their challenges. With our data science project, we can contribute with valuable, curated information about the business and actionable knowledge (in this case: customer profile and their behavior, existing segments). The nature of this data science problem is one of unsupervised learning because we do not know beforehand what the existing segments of this customer base are. We will benefit from applying clustering techniques and letting the data show what are they.
The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
!pip install scikit-learn
!pip install scikit-learn-extra
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# To compute distances
from scipy.spatial.distance import cdist
# To perform K-means clustering and compute Silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# To visualize the elbow curve and Silhouette scores
from yellowbrick.cluster import SilhouetteVisualizer
# Importing PCA
from sklearn.decomposition import PCA
# To encode the variable
from sklearn.preprocessing import LabelEncoder
# Importing TSNE
from sklearn.manifold import TSNE
# To perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# To compute distances
from scipy.spatial.distance import pdist
# To import K-Medoids
from sklearn_extra.cluster import KMedoids
# To import Gaussian Mixture
from sklearn.mixture import GaussianMixture
# To import DBSCAN
from sklearn.cluster import DBSCAN
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
Requirement already satisfied: scikit-learn in /opt/anaconda3/lib/python3.9/site-packages (1.0.2) Requirement already satisfied: numpy>=1.14.6 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (1.21.5) Requirement already satisfied: scipy>=1.1.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (1.7.3) Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (1.1.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn) (2.2.0) Requirement already satisfied: scikit-learn-extra in /opt/anaconda3/lib/python3.9/site-packages (0.2.0) Requirement already satisfied: scikit-learn>=0.23.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn-extra) (1.0.2) Requirement already satisfied: numpy>=1.13.3 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn-extra) (1.21.5) Requirement already satisfied: scipy>=0.19.1 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn-extra) (1.7.3) Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (2.2.0) Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.1.0)
data = pd.read_csv("marketing_campaign.csv")
data.shape
(2240, 27)
This is a customer database of 2.240 entries and 27 columns.
# Checking if there are duplicated entries.
data[data.duplicated()]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response |
|---|
0 rows × 27 columns
data.head(10)
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 7446 | 1967 | Master | Together | 62513.0 | 0 | 1 | 09-09-2013 | 16 | 520 | ... | 4 | 10 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 965 | 1971 | Graduation | Divorced | 55635.0 | 0 | 1 | 13-11-2012 | 34 | 235 | ... | 3 | 7 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 6177 | 1985 | PhD | Married | 33454.0 | 1 | 0 | 08-05-2013 | 32 | 76 | ... | 0 | 4 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 4855 | 1974 | PhD | Together | 30351.0 | 1 | 0 | 06-06-2013 | 19 | 14 | ... | 0 | 2 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 9 | 5899 | 1950 | PhD | Together | 5648.0 | 1 | 1 | 13-03-2014 | 68 | 28 | ... | 0 | 0 | 20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
10 rows × 27 columns
data.tail(10)
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2230 | 7004 | 1984 | Graduation | Single | 11012.0 | 1 | 0 | 16-03-2013 | 82 | 24 | ... | 1 | 2 | 9 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2231 | 9817 | 1970 | Master | Single | 44802.0 | 0 | 0 | 21-08-2012 | 71 | 853 | ... | 4 | 12 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2232 | 8080 | 1986 | Graduation | Single | 26816.0 | 0 | 0 | 17-08-2012 | 50 | 5 | ... | 0 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2233 | 9432 | 1977 | Graduation | Together | 666666.0 | 1 | 0 | 02-06-2013 | 23 | 9 | ... | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2234 | 8372 | 1974 | Graduation | Married | 34421.0 | 1 | 0 | 01-07-2013 | 81 | 3 | ... | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2235 | 10870 | 1967 | Graduation | Married | 61223.0 | 0 | 1 | 13-06-2013 | 46 | 709 | ... | 3 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2236 | 4001 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 10-06-2014 | 56 | 406 | ... | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2237 | 7270 | 1981 | Graduation | Divorced | 56981.0 | 0 | 0 | 25-01-2014 | 91 | 908 | ... | 3 | 13 | 6 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2238 | 8235 | 1956 | Master | Together | 69245.0 | 0 | 1 | 24-01-2014 | 8 | 428 | ... | 5 | 10 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2239 | 9405 | 1954 | PhD | Married | 52869.0 | 1 | 1 | 15-10-2012 | 40 | 84 | ... | 1 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
10 rows × 27 columns
Here we get a first view of the names of the columns and the type of data collected.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
# Finding the percentage of missing values in each column of the data
(data.isnull().sum()/data.shape[0]*100)
ID 0.000000 Year_Birth 0.000000 Education 0.000000 Marital_Status 0.000000 Income 1.071429 Kidhome 0.000000 Teenhome 0.000000 Dt_Customer 0.000000 Recency 0.000000 MntWines 0.000000 MntFruits 0.000000 MntMeatProducts 0.000000 MntFishProducts 0.000000 MntSweetProducts 0.000000 MntGoldProds 0.000000 NumDealsPurchases 0.000000 NumWebPurchases 0.000000 NumCatalogPurchases 0.000000 NumStorePurchases 0.000000 NumWebVisitsMonth 0.000000 AcceptedCmp3 0.000000 AcceptedCmp4 0.000000 AcceptedCmp5 0.000000 AcceptedCmp1 0.000000 AcceptedCmp2 0.000000 Complain 0.000000 Response 0.000000 dtype: float64
The only column that has missing values is Income. I will treat it later on in this project.
We can also observe that ID has no null values. And the number of unique values are equal to the number of observations. So, ID looks like an index for the data entry and such a column would not be useful in providing any predictive power for our analysis. Hence, it can be dropped.
Dropping the ID column:
data1 = data.copy()
data.drop("ID", axis = 1, inplace= True)
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2240.0 | 1968.805804 | 11.984069 | 1893.0 | 1959.00 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2216.0 | 52247.251354 | 25173.076661 | 1730.0 | 35303.00 | 51381.5 | 68522.00 | 666666.0 |
| Kidhome | 2240.0 | 0.444196 | 0.538398 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2240.0 | 0.506250 | 0.544538 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Recency | 2240.0 | 49.109375 | 28.962453 | 0.0 | 24.00 | 49.0 | 74.00 | 99.0 |
| MntWines | 2240.0 | 303.935714 | 336.597393 | 0.0 | 23.75 | 173.5 | 504.25 | 1493.0 |
| MntFruits | 2240.0 | 26.302232 | 39.773434 | 0.0 | 1.00 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2240.0 | 166.950000 | 225.715373 | 0.0 | 16.00 | 67.0 | 232.00 | 1725.0 |
| MntFishProducts | 2240.0 | 37.525446 | 54.628979 | 0.0 | 3.00 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2240.0 | 27.062946 | 41.280498 | 0.0 | 1.00 | 8.0 | 33.00 | 263.0 |
| MntGoldProds | 2240.0 | 44.021875 | 52.167439 | 0.0 | 9.00 | 24.0 | 56.00 | 362.0 |
| NumDealsPurchases | 2240.0 | 2.325000 | 1.932238 | 0.0 | 1.00 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2240.0 | 4.084821 | 2.778714 | 0.0 | 2.00 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2240.0 | 2.662054 | 2.923101 | 0.0 | 0.00 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2240.0 | 5.790179 | 3.250958 | 0.0 | 3.00 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2240.0 | 5.316518 | 2.426645 | 0.0 | 3.00 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2240.0 | 0.074554 | 0.262728 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2240.0 | 0.072768 | 0.259813 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2240.0 | 0.064286 | 0.245316 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2240.0 | 0.012946 | 0.113069 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Complain | 2240.0 | 0.009375 | 0.096391 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Response | 2240.0 | 0.149107 | 0.356274 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
This is the first exploration of the statistics of the numerical data.
cols = ["Education", "Marital_Status", "Kidhome", "Teenhome"]
Number of unique observations in each category:
for column in cols:
print("Unique values in", column, "are :")
print(data[column].value_counts(normalize = True))
print("*" * 50)
Unique values in Education are : Graduation 0.503125 PhD 0.216964 Master 0.165179 2n Cycle 0.090625 Basic 0.024107 Name: Education, dtype: float64 ************************************************** Unique values in Marital_Status are : Married 0.385714 Together 0.258929 Single 0.214286 Divorced 0.103571 Widow 0.034375 Alone 0.001339 Absurd 0.000893 YOLO 0.000893 Name: Marital_Status, dtype: float64 ************************************************** Unique values in Kidhome are : 0 0.577232 1 0.401339 2 0.021429 Name: Kidhome, dtype: float64 ************************************************** Unique values in Teenhome are : 0 0.516964 1 0.459821 2 0.023214 Name: Teenhome, dtype: float64 **************************************************
A first exploration:
# Replace the category "2n Cycle" with the category "Master"
data["Education"].replace("2n Cycle", "Master", inplace = True) # Hint: Use the replace() method and inplace=True
# Replace the categories "Alone", "Absurd", "YOLO" with the category "Single"
data["Marital_Status"].replace("Alone", "Single", inplace = True) # Hint: Use the replace() method and inplace=True
data["Marital_Status"].replace(["Absurd", "YOLO"], "Others", inplace = True)
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Histograms help to visualize and describe numerical data. We can also use other plots like box plot to analyze the numerical columns.
plt.figure(figsize=(10, 7))
sns.histplot(x="Income", data=data)
plt.show()
We could observe some extreme value on the right side of the distribution of the 'Income' feature. Let's use a box plot as it is more suitable to identify extreme values in the data.
sns.boxplot(data=data, x="Income", showmeans=True, color="violet")
<AxesSubplot:xlabel='Income'>
There are some outliers in Income. Let's analyze them a bit further to see decide to keep them or drop them.
# Calculating the upper whisker for the Income variable
Q1 = data.quantile(q=0.25) # Finding the first quartile
Q3 = data.quantile(q=0.75) # Finding the third quartile
IQR = Q3 - Q1 # Finding the Inter Quartile Range
upper_whisker = (Q3 + 1.5 * IQR)["Income"] # Calculating the Upper Whisker for the Income variable
print(upper_whisker) # Printing Upper Whisker
118350.5
# Let's check the observations with extreme value for the Income variable
data[data.Income > upper_whisker]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 1973 | PhD | Married | 157243.0 | 0 | 1 | 01-03-2014 | 98 | 20 | 2 | ... | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 617 | 1976 | PhD | Together | 162397.0 | 1 | 1 | 03-06-2013 | 31 | 85 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 655 | 1975 | Graduation | Divorced | 153924.0 | 0 | 0 | 07-02-2014 | 81 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 687 | 1982 | PhD | Married | 160803.0 | 0 | 0 | 04-08-2012 | 21 | 55 | 16 | ... | 28 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1300 | 1971 | Master | Together | 157733.0 | 1 | 0 | 04-06-2013 | 37 | 39 | 1 | ... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1653 | 1977 | Graduation | Together | 157146.0 | 0 | 0 | 29-04-2013 | 13 | 1 | 0 | ... | 28 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2132 | 1949 | PhD | Married | 156924.0 | 0 | 0 | 29-08-2013 | 85 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2233 | 1977 | Graduation | Together | 666666.0 | 1 | 0 | 02-06-2013 | 23 | 9 | 14 | ... | 1 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 rows × 26 columns
# Checking the 99.5% percentile value for the Income variable
data.quantile(q=0.995)["Income"]
102145.75000000003
These outliers represent only 0.3% of the customer base. Because they do not hold any statistical significance, I will drop them.
# Dropping observations identified as outliers
data.drop(index=[164, 617, 655, 687, 1300, 1653, 2132, 2233], inplace=True) # Pass the indices of the observations (separated by a comma) to drop them
Now, let's check the distribution of the Income variable after dropping outliers.
plt.figure(figsize=(10, 7))
sns.histplot(x="Income", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="MntWines", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="MntMeatProducts", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="MntGoldProds", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="MntFishProducts", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="MntSweetProducts", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="MntFruits", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="Recency", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="Year_Birth", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="NumStorePurchases", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="NumWebPurchases", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="NumWebVisitsMonth", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="NumCatalogPurchases", data=data)
plt.show()
plt.figure(figsize=(10, 7))
sns.histplot(x="NumDealsPurchases", data=data)
plt.show()
Let us write a function that will help us create bar plots that indicate the percentage for each category. This function takes the categorical column as the input and returns the bar plot for the variable.
def perc_on_bar(z):
'''
plot
feature: categorical feature
the function won't work if a column is passed in hue parameter
'''
total = len(data[z]) # Length of the column
plt.figure(figsize=(15,5))
ax = sns.countplot(data[z],palette='Paired',order = data[z].value_counts().index)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # Percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # Width of the plot
y = p.get_y() + p.get_height() # Height of the plot
ax.annotate(percentage, (x, y), size = 12) # Annotate the percentage
plt.show() # Show the plot
perc_on_bar('Marital_Status')
perc_on_bar('Education')
perc_on_bar('Kidhome')
perc_on_bar('Teenhome')
We have analyzed different categorical and numerical variables. Now, let's check how different variables are related to each other.
Heat map can show a 2D correlation matrix between numerical features.
plt.figure(figsize=(17, 10))
sns.heatmap(data.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap='rocket', linewidths=3)
<AxesSubplot:>
The insights captured from this heat map are very interesting. I chose a threshold of 0.60 as a moderate correlation to consider as worthy of further analysis. Based on this, we can observe the following moderate positive correlation:
The above correlation heatmap only shows the relationship between numerical variables. Let's check the relationship of numerical variables with categorical variables.
print(sns.barplot(x="Education", y="Income", data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
print(sns.barplot(x="Marital_Status", y="Income", data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
print(sns.barplot(x="Kidhome", y="Income", data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
We can also visualize the relationship between two categorical variables.
pd.crosstab(data.Marital_Status, data.Kidhome).plot(kind="bar",stacked=False)
<AxesSubplot:xlabel='Marital_Status'>
In this section, we will first prepare our dataset for analysis.
Think About It:
# Extracting only the year from the Year_Birth variable and subtracting it from 2016 will give us the age of the customer at the time of data collection in 2016
data["Age"] = 2016 - pd.to_datetime(data["Year_Birth"], format="%Y").apply(lambda x: x.year)
# Sorting the values in ascending order
data["Age"].sort_values()
1170 20
46 20
696 21
747 21
1850 21
...
424 75
1950 76
192 116
339 117
239 123
Name: Age, Length: 2232, dtype: int64
# Finding and dropping the outliers, any age above 115.
data[data["Age"] > 115]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 192 | 1900 | Master | Divorced | 36640.0 | 1 | 0 | 26-09-2013 | 99 | 15 | 6 | ... | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 116 |
| 239 | 1893 | Master | Single | 60182.0 | 0 | 1 | 17-05-2014 | 23 | 8 | 0 | ... | 2 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 123 |
| 339 | 1899 | PhD | Together | 83532.0 | 0 | 0 | 26-09-2013 | 36 | 755 | 144 | ... | 4 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 117 |
3 rows × 27 columns
data.drop([192, 239, 339], axis=0, inplace=True)
data.shape
(2229, 27)
Now, let's check the distribution of age in the data.
plt.figure(figsize=(15, 7))
sns.histplot(x="Age", data=data)
plt.show()
# Adding Kidhome and Teenhome variables to create the new feature called "Kids"
data["Kids"] = data["Kidhome"] + data["Teenhome"]
# Checking the unique categories in Marial_Status
data["Marital_Status"].unique()
array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Others'],
dtype=object)
data["Marital_Status"].value_counts()
Married 861 Together 575 Single 482 Divorced 230 Widow 77 Others 4 Name: Marital_Status, dtype: int64
# Replacing "Married" and "Together" with "Relationship"
data["Marital_Status"].replace(["Married", "Together"], "Relationship", inplace = True)
# Replacing "Divorced" and "Widow" with "Single"
data["Marital_Status"].replace(["Divorced", "Widow"], "Single", inplace = True)
data["Marital_Status"].replace({"Single": 1, "Others": 1, "Relationship": 2}, inplace = True)
data["Marital_Status"].value_counts()
2 1436 1 793 Name: Marital_Status, dtype: int64
# Adding two variables Status and Kids to get the total number of persons in each family
data["Family_Size"] = data["Marital_Status"] + data["Kids"]
data["Expenses"] = data["MntWines"] + data["MntFruits"] + data["MntMeatProducts"] + data["MntFishProducts"] + data["MntSweetProducts"] + data["MntGoldProds"]
data["NumTotalPurchases"] = data["NumDealsPurchases"] + data["NumWebPurchases"] + data["NumCatalogPurchases"] + data["NumStorePurchases"]
# Converting Dt_customer variable to Python date time object
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"])
# Sorting the values in ascending order
data["Dt_Customer"].sort_values()
2029 2012-01-08
724 2012-01-08
976 2012-01-08
2194 2012-01-08
1473 2012-01-09
...
1952 2014-12-05
216 2014-12-05
288 2014-12-05
153 2014-12-05
2003 2014-12-06
Name: Dt_Customer, Length: 2229, dtype: datetime64[ns]
Let's check the max and min of the date.
# Checking the minimum of the date
min(data.Dt_Customer)
Timestamp('2012-01-08 00:00:00')
# Checking the maximum of the date
max(data.Dt_Customer)
Timestamp('2014-12-06 00:00:00')
Think About It:
# Assigning date to the day variable
data["day"] = "01-01-2015"
# Converting the variable day to Python datetime object
data["day"] = pd.to_datetime(data.day)
data["Engaged_in_days"] = (data["day"] - data["Dt_Customer"]).dt.days
data["TotalAcceptedCmp"] = data["AcceptedCmp1"] + data["AcceptedCmp2"] + data["AcceptedCmp3"] + data["AcceptedCmp4"] + data["AcceptedCmp5"]
data['AmountPerPurchase'] = data['Expenses'] / data['NumTotalPurchases']
Now, let's check the maximum value of the AmountPerPurchase.
max(data.AmountPerPurchase)
inf
# Finding how many observations have NumTotalPurchases equal to 0
data[data["NumTotalPurchases"] == 0]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | Response | Age | Kids | Family_Size | Expenses | NumTotalPurchases | day | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 981 | 1965 | Graduation | 1 | 4861.0 | 0 | 0 | 2014-06-22 | 20 | 2 | 1 | ... | 0 | 51 | 0 | 1 | 6 | 0 | 2015-01-01 | 193 | 0 | inf |
| 1524 | 1973 | Graduation | 1 | 3502.0 | 1 | 0 | 2013-04-13 | 56 | 2 | 1 | ... | 0 | 43 | 1 | 2 | 5 | 0 | 2015-01-01 | 628 | 0 | inf |
2 rows × 35 columns
# Dropping the observations with NumTotalPurchases equal to 0
data.drop([981, 1524], axis=0, inplace=True)
max(data.AmountPerPurchase)
1679.0
min(data.AmountPerPurchase)
0.5333333333333333
Now, let's check the distribution of values in AmountPerPurchase column.
data["AmountPerPurchase"].describe().T
count 2227.000000 mean 33.274270 std 45.040897 min 0.533333 25% 9.714286 50% 23.352941 75% 45.281773 max 1679.000000 Name: AmountPerPurchase, dtype: float64
plt.figure(figsize=(10, 7))
sns.histplot(x="AmountPerPurchase", data=data)
plt.show()
# Imputing the missing values for the Income variable with the median
data["Income"].fillna(data.Income.median(), inplace = True)
Now that we are done with data preprocessing, let's visualize new features against the new income variable we have after imputing missing values.
# Plot the scatter plot with Expenses on Y-axis and Income on X-axis
plt.figure(figsize=(10, 7)) # Setting the plot size
sns.scatterplot(data = data, x = "Income", y = "Expenses") # Hint: Use sns.scatterplot()
plt.xticks(fontsize=16) # Font size of X-label
plt.yticks(fontsize=16) # Font size of Y-label
plt.xlabel("Income", fontsize=20, labelpad=20) # Title of X-axis
plt.ylabel("Expenses", fontsize=20, labelpad=20) # Title of Y-axis
Text(0, 0.5, 'Expenses')
print(sns.barplot(x="Family_Size", y="Income", data=data))
AxesSubplot(0.125,0.125;0.775x0.755)
Considering the data exploration and preparation already done, my proposed approach would be to follow the next steps:
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2227 entries, 0 to 2239 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2227 non-null int64 1 Education 2227 non-null object 2 Marital_Status 2227 non-null int64 3 Income 2227 non-null float64 4 Kidhome 2227 non-null int64 5 Teenhome 2227 non-null int64 6 Dt_Customer 2227 non-null datetime64[ns] 7 Recency 2227 non-null int64 8 MntWines 2227 non-null int64 9 MntFruits 2227 non-null int64 10 MntMeatProducts 2227 non-null int64 11 MntFishProducts 2227 non-null int64 12 MntSweetProducts 2227 non-null int64 13 MntGoldProds 2227 non-null int64 14 NumDealsPurchases 2227 non-null int64 15 NumWebPurchases 2227 non-null int64 16 NumCatalogPurchases 2227 non-null int64 17 NumStorePurchases 2227 non-null int64 18 NumWebVisitsMonth 2227 non-null int64 19 AcceptedCmp3 2227 non-null int64 20 AcceptedCmp4 2227 non-null int64 21 AcceptedCmp5 2227 non-null int64 22 AcceptedCmp1 2227 non-null int64 23 AcceptedCmp2 2227 non-null int64 24 Complain 2227 non-null int64 25 Response 2227 non-null int64 26 Age 2227 non-null int64 27 Kids 2227 non-null int64 28 Family_Size 2227 non-null int64 29 Expenses 2227 non-null int64 30 NumTotalPurchases 2227 non-null int64 31 day 2227 non-null datetime64[ns] 32 Engaged_in_days 2227 non-null int64 33 TotalAcceptedCmp 2227 non-null int64 34 AmountPerPurchase 2227 non-null float64 dtypes: datetime64[ns](2), float64(2), int64(30), object(1) memory usage: 626.3+ KB
The decision about which variables to use for clustering is a critically important decision that will have a big impact on the clustering solution. So we need to think carefully about the variables we will choose for clustering. Clearly, this is a step where a lot of contextual knowledge, creativity, and experimentation/iterations are needed.
Moreover, we often use only a few of the data attributes for segmentation (the segmentation attributes) and use some of the remaining ones (the profiling attributes) only to profile the clusters. For example, in market research and market segmentation, we can use behavioral data for segmentation (to segment the customers based on their behavior like amount spent, units bought, etc.), and then use both demographic as well as behavioral data for profiling the segments found.
Here, we will use the behavioral attributes for segmentation and drop the demographic attributes like Income, Age, and Family_Size. In addition to this, we need to drop some other columns which are mentioned below.
Dt_Customer: We have created the Engaged_in_days variable using the Dt_Customer variable. Hence, we can drop this variable as it will not help with segmentation.Complain: About 95% of the customers didn't complain and have the same value for this column. This variable will not have a major impact on segmentation. Hence, we can drop this variable. day: We have created the Engaged_in_days variable using the 'day' variable. Hence, we can drop this variable as it will not help with segmentation.Status: This column was created just to get the Family_Size variable that contains the information about the Status. Hence, we can drop this variable.Education and Marital_Status, Kids, Kidhome, and Teenhome as distance-based algorithms cannot use the default distance like Euclidean to find the distance between categorical and numerical variables.AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, and Response for which we have create the variable TotalAcceptedCmp which is the aggregate of all these variables.# Dropping all the irrelevant columns and storing in data_model
data_model = data.drop(
columns=[
"Year_Birth",
"Dt_Customer",
"day",
"Complain",
"Response",
"AcceptedCmp1",
"AcceptedCmp2",
"AcceptedCmp3",
"AcceptedCmp4",
"AcceptedCmp5",
"Marital_Status",
"Kids",
'Education',
'Kidhome',
'Teenhome', 'Income','Age', 'Family_Size'
],
axis=1,
)
data_model.shape
(2227, 17)
# Check first five rows of new data
data_model.head(5)
| Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 1617 | 25 | 997 | 0 | 64.680000 |
| 1 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 27 | 6 | 151 | 0 | 4.500000 |
| 2 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 776 | 21 | 498 | 0 | 36.952381 |
| 3 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 53 | 8 | 91 | 0 | 6.625000 |
| 4 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 422 | 19 | 347 | 0 | 22.210526 |
Let's plot the correlation plot after we've removed the irrelevant variables.
# Plot the correlation plot for new data
plt.figure(figsize=(15, 7))
sns.heatmap(data_model.corr(), annot= True, vmin=-1, vmax=1,fmt=" .2f", cmap='rocket', linewidths=3)
plt.show()
Observations and insights: I choose a threshold of 0.60 as a minimum to consider that a correlation moderately positive exists between variables. These are the correlations this heat map shows over that threshold:
1) Expenses with:
2) NumTotalPurchases with:
3) MntWines with:
4) MntMeatProducts with:
What is feature scaling?
Feature scaling is a class of statistical techniques that, as the name implies, scales the features of our data so that they all have a similar range. You'll understand better if we look at an example:
If you have multiple independent variables like Age, Income, and Amount related variables, with their range as (18–100 Years), (25K–75K), and (100–200), respectively, feature scaling would help them all to be in the same range.
Why feature scaling is important in Unsupervised Learning?
Feature scaling is especially relevant in machine learning models that compute some sort of distance metric as we do in most clustering algorithms, for example, K-Means.
So, scaling should be done to avoid the problem of one feature dominating over others because the unsupervised learning algorithm uses distance to find the similarity between data points.
Let's scale the data
Standard Scaler: StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.
# Applying standard scaler on new data
scaler = StandardScaler() # Initialize the Standard Scaler
data_scaled = StandardScaler().fit_transform(data_model) # fit_transform the scaler function on new data
data_scaled = pd.DataFrame(data_scaled, columns=data_model.columns) # Converting the embeddings to a dataframe
data_scaled.head(10)
| Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.306906 | 0.979274 | 1.549793 | 1.735137 | 2.455586 | 1.471064 | 0.841828 | 0.357919 | 1.404892 | 2.633008 | -0.561330 | 0.696875 | 1.682349 | 1.329371 | 1.975678 | -0.440703 | 0.697428 |
| 1 | -0.384035 | -0.873925 | -0.638021 | -0.726901 | -0.652191 | -0.633425 | -0.732464 | -0.169952 | -1.119121 | -0.586255 | -1.178960 | -0.135935 | -0.963691 | -1.167064 | -1.667464 | -0.440703 | -0.638991 |
| 2 | -0.798600 | 0.358572 | 0.569049 | -0.175222 | 1.340442 | -0.149634 | -0.041311 | -0.697824 | 1.404892 | -0.228559 | 1.291559 | -0.552339 | 0.282777 | 0.803806 | -0.173173 | -0.440703 | 0.081680 |
| 3 | -0.798600 | -0.873925 | -0.562579 | -0.663070 | -0.505942 | -0.585045 | -0.751662 | -0.169952 | -0.758548 | -0.943951 | -0.561330 | 0.280470 | -0.920422 | -0.904281 | -1.925843 | -0.440703 | -0.591801 |
| 4 | 1.550599 | -0.392806 | 0.418165 | -0.216256 | 0.152175 | -0.004497 | -0.559676 | 1.413662 | 0.323172 | 0.129137 | 0.056299 | -0.135935 | -0.306341 | 0.541023 | -0.823427 | -0.440703 | -0.245693 |
| 5 | -1.144070 | 0.637739 | 0.393018 | -0.307442 | -0.688753 | 0.358346 | -0.578874 | -0.169952 | 0.683745 | 0.486833 | 1.291559 | 0.280470 | 0.182926 | 0.935197 | -0.254993 | -0.440703 | -0.016185 |
| 6 | -0.522224 | -0.208674 | 0.971406 | -0.006527 | 0.225299 | 0.527673 | -0.329291 | 0.885790 | 1.044319 | 0.129137 | 0.365114 | 0.280470 | -0.026760 | 0.803806 | 1.036902 | -0.440703 | -0.115011 |
| 7 | -0.591318 | -0.680883 | -0.411695 | -0.498934 | -0.633910 | -0.633425 | -0.406086 | -0.169952 | -0.037401 | -0.943951 | -0.561330 | 1.113279 | -0.727378 | -0.641499 | -0.104272 | -0.440703 | -0.363624 |
| 8 | -1.040429 | -0.865015 | -0.663168 | -0.644833 | -0.633910 | -0.585045 | -0.809258 | -0.697824 | -0.397975 | -0.943951 | -1.178960 | 1.529684 | -0.932072 | -1.167064 | 0.154107 | -0.440703 | -0.568669 |
| 9 | 0.652376 | -0.823437 | -0.663168 | -0.726901 | -0.670472 | -0.633425 | -0.598073 | -0.697824 | -1.119121 | -0.943951 | -1.796590 | 6.110134 | -0.927079 | -1.692629 | -1.051661 | 1.035154 | -0.194850 |
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed
tsne = TSNE(n_components=2, random_state=1, perplexity=35) # Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35
data_customer_tsne = tsne.fit_transform(data_model) # fit_transform T-SNE on new data
data_customer_tsne = pd.DataFrame(data_customer_tsne, columns=[0, 1]) # Converting the embeddings to a dataframe
plt.figure(figsize=(7, 7)) # Scatter plot for two components
sns.scatterplot(x=0, y=1, data=data_customer_tsne) # Plotting T-SNE
<AxesSubplot:xlabel='0', ylabel='1'>
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed
tsne = TSNE(n_components=2, random_state=1, perplexity=15) # Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35
data_customer_tsne = tsne.fit_transform(data_model) # fit_transform T-SNE on new data
data_customer_tsne = pd.DataFrame(data_customer_tsne, columns=[0, 1]) # Converting the embeddings to a dataframe
plt.figure(figsize=(7, 7)) # Scatter plot for two components
sns.scatterplot(x=0, y=1, data=data_customer_tsne) # Plotting T-SNE
<AxesSubplot:xlabel='0', ylabel='1'>
# Fitting T-SNE with number of components equal to 2 to visualize how data is distributed
tsne = TSNE(n_components=2, random_state=1, perplexity=50) # Initializing T-SNE with number of component equal to 2, random_state=1, and perplexity=35
data_customer_tsne = tsne.fit_transform(data_model) # fit_transform T-SNE on new data
data_customer_tsne = pd.DataFrame(data_customer_tsne, columns=[0, 1]) # Converting the embeddings to a dataframe
plt.figure(figsize=(7, 7)) # Scatter plot for two components
sns.scatterplot(x=0, y=1, data=data_customer_tsne) # Plotting T-SNE
<AxesSubplot:xlabel='0', ylabel='1'>
Observation and Insights: I tried three values for perplexity: 35, 15 and 50. From these scatterplots, it is not possible to observe or find any meaningful clusters that would give and idea of how the data is distributed. So far, is not possible to identify underlying cluster patterns in the data.
When the variables used in clustering are highly correlated, it causes multicollinearity, which affects the clustering method and results in poor cluster profiling (or biased toward a few variables). PCA can be used to reduce the multicollinearity between the variables.
# Defining the number of principal components to generate
n = data_model.shape[1] # Storing the number of variables in the data
pca = PCA(n_components=n, random_state = 1) # Initialize PCA with n_components = n and random_state=1
data_pca = pd.DataFrame(pca.fit_transform(data_scaled)) # fit_transform PCA on the scaled data
# The percentage of variance explained by each principal component is stored
exp_var = pca.explained_variance_ratio_
Let's plot the first two components and see how the data points are distributed.
# Scatter plot for two components using the dataframe data_pca
plt.figure(figsize = (7, 7))
sns.scatterplot(x = data_pca[0], y = data_pca[1])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
Let's apply clustering algorithms on the data generated after applying PCA
distortions = [] # Create an empty list
K = range(2, 10) # Setting the K range from 2 to 10
for k in K:
kmeanModel = KMeans(n_clusters=k,random_state=4) # Initialize K-Means
kmeanModel.fit(data_pca) # Fit K-Means on the data
distortions.append(kmeanModel.inertia_) # Append distortion values to the empty list created above
# Plotting the elbow plot
plt.figure(figsize=(16, 8)) # Setting the plot size
plt.plot(K, distortions, "bx-") # Plotting the K on X-axis and distortions on y-axis
plt.xlabel("k") # Title of x-axis
plt.ylabel("Distortion") # Title of y-axis
plt.title("The Elbow Method showing the optimal k") # Title of the plot
plt.show()
Observations: It is possible to observe drops of the curve before numbers 3, 4 and 5.
We can use the silhouette score as a metric for different K values to make a better decision about picking the number of clusters(K).
Silhouette score is one of the methods for evaluating the quality of clusters created using clustering algorithms such as K-Means. The silhouette score is a measure of how similar an object is to its cluster (cohesion) compared to other clusters (separation). Silhouette score has a range of [-1, 1].
Finding silhouette score for each value of K
sil_score = [] # Creating empty list
cluster_list = range(3, 8) # Creating a range from 3 to 7
for n_clusters in cluster_list:
# Initialize K-Means with number of clusters equal to n_clusters and random_state=1
clusterer = KMeans(n_clusters = n_clusters, random_state=1)
# Fit and predict on the pca data
preds = clusterer.fit(data_pca).predict(data_pca)
# Calculate silhouette score - Hint: Use silhouette_score() function
score = silhouette_score(data_pca, preds)
# Append silhouette score to empty list created above
sil_score.append(score)
# Print the silhouette score
print( "For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
For n_clusters = 3, the silhouette score is 0.2700522050350727) For n_clusters = 4, the silhouette score is 0.2576362652878985) For n_clusters = 5, the silhouette score is 0.22583613398899802) For n_clusters = 6, the silhouette score is 0.24314705614051696) For n_clusters = 7, the silhouette score is 0.12983168949386734)
Observations and Insights:
From the above silhouette scores, 3 appears to be a good value of K. So, let's build K-Means using K=3.
Let's build K-Means using K=3 first
kmeans = KMeans(n_clusters= 3, random_state= 1) # Initialize the K-Means algorithm with 3 clusters and random_state=1
kmeans.fit(data_pca) # Fitting on the data_pca
KMeans(n_clusters=3, random_state=1)
data_pca["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to the data_pca data
data["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to the whole data
data_model["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to data_model
# Let's check the distribution
data_model["K_means_segments_3"].value_counts()
1 1046 0 613 2 568 Name: K_means_segments_3, dtype: int64
Visualizing these clusters as a histogram:
plt.figure(figsize=(15,7))
sns.histplot(x='K_means_segments_3', data=data_model)
plt.show()
Let's visualize these clusters using PCA
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "K_means_segments_3")
Observations and Insights:
# Taking the cluster-wise mean of all the variables. Hint: First groupby 'data' by 'K_means_segments_3' and then find mean
cluster_profile_KMeans_3 = data.groupby('K_means_segments_3').mean()
# Highlighting the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_3.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_3 | ||||||||||||||||||||||||||||||||
| 0 | 1965.972268 | 1.654160 | 57357.499184 | 0.288744 | 0.831974 | 47.712887 | 443.132137 | 22.879282 | 135.845024 | 30.053834 | 24.394780 | 61.404568 | 3.773246 | 6.464927 | 3.060359 | 7.843393 | 5.771615 | 0.073409 | 0.119086 | 0.019576 | 0.035889 | 0.014682 | 0.008157 | 0.133768 | 50.027732 | 1.120718 | 2.774878 | 717.709625 | 21.141925 | 593.830343 | 0.262643 | 33.698297 |
| 1 | 1971.021033 | 1.652008 | 35170.983748 | 0.756214 | 0.474187 | 49.165392 | 43.802103 | 4.975143 | 23.100382 | 7.272467 | 5.079350 | 14.972275 | 2.009560 | 2.113767 | 0.564054 | 3.268642 | 6.377629 | 0.066922 | 0.015296 | 0.000000 | 0.000956 | 0.001912 | 0.010516 | 0.085086 | 44.978967 | 1.230402 | 2.882409 | 99.201721 | 7.956023 | 499.522945 | 0.085086 | 11.196880 |
| 2 | 1968.109155 | 1.621479 | 75881.901408 | 0.038732 | 0.220070 | 50.540493 | 637.966549 | 69.542254 | 459.471831 | 101.890845 | 70.908451 | 79.267606 | 1.330986 | 5.220070 | 6.005282 | 8.325704 | 2.910211 | 0.084507 | 0.137324 | 0.264085 | 0.213028 | 0.031690 | 0.007042 | 0.286972 | 47.890845 | 0.258803 | 1.880282 | 1419.047535 | 20.882042 | 549.441901 | 0.730634 | 73.473251 |
Observations and Insights on each cluster:
Cluster 0 has 613 observations. It corresponds to a segment of customers that have the highest mean in columns Marital_Status, Teenhome, NumDealsPurchases, NumWebPurchases, Age, NumTotalPurchases, Engaged_in_days.
Cluster 1 has 1046 observations. It corresponds to a segment of customers that has the highest mean in columns Year_Birth, Kidhome, NumWebVisitsMonth, Complain, Kids, and Family_Size.
Cluster 2: it has 568 observations. It has the highest mean in columns Income, Recency, all products columns, NumCatalogPurchases, NumStorePurchases, all AcceptedCmp columns, Response, Expenses, TotalAcceptedCmp, and AmountPerPurchase.
Let us create a boxplot for each of the variables
# Columns to use in boxplot
col_for_box = ['Income', 'Kidhome', 'Teenhome', 'Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age', 'Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
# Creating boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_means_segments_3'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Cluster 0:
Summary for cluster 0: They are medium-income customers, living with a kid at home. They spent the most on wines, and they mainly purchase at the store. They are the segment that purchases deals the most and also visit the website frequently. The catalog is the channel they use the least. They are the earliest clients, been engaged with the store for a median of 600 days. The median is 50 years old.
Cluster 1 profile:
Summary for cluster 1: This cluster is made up of low-income customers, that live with a kid at home. They are between 38 and 53 years old. They spent the least amount overall but when they buy something, they purchase deals and mostly in store. They mainly purchase wine. They are the one that visit the website the most. They haven't accepted any campaign.
Cluster 2 profile:
Summary for cluster 2: This cluster is made up of high-income customers, with no children living at home with them. They have spent the largest amounts in all products, during the last 2 years. The main channel they use is in store purchases, and they do not buy deals. They are between 49 and 59 years old. They have accepted a campaign. They have been clients for more than a year.
# Dropping labels we got from K=3 since we will be using PCA data for prediction
# Drop K_means_segments_3. Hint: Use axis=1 and inplace=True
data_pca.drop(['K_means_segments_3'], axis=1, inplace=True)
data.drop(['K_means_segments_3'], axis=1, inplace=True)
Let's build K-Means using K=4
# Fit the K-Means algorithm using number of cluster as 4 and random_state=1 on data_pca
kmeans = KMeans(n_clusters= 4, random_state= 1).fit(data_pca)
data_pca["K_means_segments_4"] = kmeans.labels_ # Adding K-Means cluster labels to the data_pca data
data["K_means_segments_4"] = kmeans.labels_ # Adding K-Means cluster labels to the whole data
data_model["K_means_segments_4"] = kmeans.labels_ # Add K-Means cluster labels to data_model
# Let's check the distribution
data_model["K_means_segments_4"].value_counts()
0 1003 2 558 1 468 3 198 Name: K_means_segments_4, dtype: int64
Visualizing these clusters as a histogram:
plt.figure(figsize=(15,7))
sns.histplot(x='K_means_segments_4', data=data_model)
plt.show()
Let's visualize the clusters using PCA
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "K_means_segments_4")
Observations and Insights:
Most data points follow a similar pattern to what we saw on K=3, we can tell they have similar Silhouette scores.
# Take the cluster-wise mean of all the variables. Hint: First groupby 'data' by cluster labels column and then find mean
cluster_profile_KMeans_4 = data.groupby('K_means_segments_4').mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_4.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_4 | ||||||||||||||||||||||||||||||||
| 0 | 1971.286142 | 1.655035 | 34797.108674 | 0.758724 | 0.457627 | 49.201396 | 38.528415 | 4.860419 | 21.354935 | 7.089731 | 5.076770 | 14.475573 | 1.900299 | 2.012961 | 0.529412 | 3.212363 | 6.343968 | 0.063809 | 0.012961 | 0.000000 | 0.000997 | 0.001994 | 0.010967 | 0.082752 | 44.713858 | 1.216351 | 2.871386 | 91.385842 | 7.655035 | 493.780658 | 0.079761 | 10.846512 |
| 1 | 1967.713675 | 1.630342 | 71614.416667 | 0.057692 | 0.311966 | 50.085470 | 479.042735 | 72.032051 | 398.544872 | 102.068376 | 71.410256 | 81.948718 | 1.628205 | 5.282051 | 5.570513 | 8.589744 | 2.995726 | 0.025641 | 0.014957 | 0.044872 | 0.061966 | 0.000000 | 0.008547 | 0.141026 | 48.286325 | 0.369658 | 2.000000 | 1205.047009 | 21.070513 | 547.267094 | 0.147436 | 59.625060 |
| 2 | 1965.643369 | 1.648746 | 55119.998208 | 0.345878 | 0.876344 | 47.695341 | 415.107527 | 17.336918 | 116.621864 | 24.019713 | 17.978495 | 55.842294 | 4.077061 | 6.327957 | 2.729391 | 7.272401 | 6.130824 | 0.080645 | 0.116487 | 0.010753 | 0.026882 | 0.008961 | 0.007168 | 0.136201 | 50.356631 | 1.222222 | 2.870968 | 646.906810 | 20.406810 | 605.026882 | 0.243728 | 31.229751 |
| 3 | 1968.666667 | 1.616162 | 80181.363636 | 0.045455 | 0.186869 | 50.398990 | 936.141414 | 52.873737 | 481.833333 | 78.898990 | 60.601010 | 72.196970 | 1.151515 | 5.641414 | 6.141414 | 8.363636 | 3.414141 | 0.212121 | 0.414141 | 0.681818 | 0.500000 | 0.111111 | 0.005051 | 0.550505 | 47.333333 | 0.232323 | 1.848485 | 1682.545455 | 21.297980 | 553.606061 | 1.919192 | 90.363729 |
Observations and Insights on each cluster:
From this information, I do not recognize any clear pattern that differentiate one cluster from another. So let's see if there is any relevant additional information comes up by observing the boxplots.
Let's plot the boxplot
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
# Creating boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_means_segments_4'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
The features where these clusters differ the most are:
Cluster 0 profile:
Summary for cluster 0: This cluster is made up of low-income customers, include the youngest clients of 20 years old and a median of 44 years old. They live with children or teenagers at home, they spend low amounts of money in the store, have the lowest rates of purchases, amount spent and have not accepted campaigns. They buy in store, and are the ones that visit the website the most monthly.
Cluster 1 profile:
Summary for cluster 1: This cluster is made up of medium-high income customers. They live with a teenager at home. They make the largest number of total purchases of all segments, buying wine in the first place, and purchase mostly in store.
Cluster 2 profile:
Summary for cluster 2: This cluster is made up of medium income customers. They live with a kid at home. They buy deals and web purchases more than all other segments, buying wine in the first place. They have been engaded with the store more than all other segments, with a median of 600 days.
Cluster 3 profile:
Summary for cluster 3: They are the segment that has spent the most in the store, but are the smallest segment, with only 198 observations in it. They are high income customers, that in first place buy wines followed by meat, from catalog and in store. They don't buy deals and they are the only ones that have accepted any campaign. They make the largest amount of total expenses in the store, with a median of 1,600 aprox.
# Dropping labels we got from K-Means since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['K_means_segments_4'], axis=1, inplace=True)
data.drop(['K_means_segments_4'], axis=1, inplace=True)
Let's build K-Means using K=5
# Fit the K-Means algorithm using number of cluster as 5 and random_state=0 on data_pca
kmeans = KMeans(n_clusters= 5, random_state= 0).fit(data_pca)
data_pca["K_means_segments_5"] = kmeans.labels_ # Adding K-Means cluster labels to the data_pca data
data["K_means_segments_5"] = kmeans.labels_ # Adding K-Means cluster labels to the whole data
data_model["K_means_segments_5"] = kmeans.labels_ # Add K-Means cluster labels to data_model
# Let's check the distribution
data_model["K_means_segments_5"].value_counts()
0 999 4 546 2 477 1 204 3 1 Name: K_means_segments_5, dtype: int64
Visualizing these clusters as a histogram:
plt.figure(figsize=(15,7))
sns.histplot(x='K_means_segments_5', data=data_model)
plt.show()
Let's visualize the clusters using PCA
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "K_means_segments_5")
Observations and Insights:
Most data points follow a similar pattern to what we saw on K=3, we can tell they have similar Silhouette scores. Although this solution K=5 and K=4, are not as well separated and defined as K=3.
# Take the cluster-wise mean of all the variables. Hint: First groupby 'data' by cluster labels column and then find mean
cluster_profile_KMeans_5 = data.groupby('K_means_segments_5').mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMeans_5.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_5 | ||||||||||||||||||||||||||||||||
| 0 | 1971.260260 | 1.654655 | 34751.676677 | 0.758759 | 0.458458 | 49.336336 | 38.158158 | 4.805806 | 21.265265 | 6.977978 | 5.061061 | 14.208208 | 1.897898 | 2.003003 | 0.523524 | 3.210210 | 6.343343 | 0.063063 | 0.013013 | 0.000000 | 0.001001 | 0.002002 | 0.011011 | 0.082082 | 44.739740 | 1.217217 | 2.871872 | 90.476476 | 7.634635 | 493.024024 | 0.079079 | 10.789356 |
| 1 | 1968.965686 | 1.622549 | 80288.941176 | 0.044118 | 0.181373 | 49.803922 | 931.220588 | 55.745098 | 481.661765 | 81.960784 | 61.318627 | 74.784314 | 1.151961 | 5.627451 | 6.176471 | 8.406863 | 3.397059 | 0.205882 | 0.401961 | 0.681373 | 0.500000 | 0.107843 | 0.004902 | 0.544118 | 47.034314 | 0.225490 | 1.848039 | 1686.691176 | 21.362745 | 556.764706 | 1.897059 | 82.595043 |
| 2 | 1967.599581 | 1.628931 | 71320.025157 | 0.060797 | 0.331237 | 50.283019 | 477.234801 | 70.324948 | 389.989518 | 99.467505 | 70.660377 | 80.190776 | 1.693920 | 5.354298 | 5.547170 | 8.616352 | 3.058700 | 0.025157 | 0.012579 | 0.037736 | 0.056604 | 0.000000 | 0.008386 | 0.138365 | 48.400419 | 0.392034 | 2.020964 | 1187.867925 | 21.211740 | 545.127883 | 0.132075 | 58.486627 |
| 3 | 1978.000000 | 2.000000 | 51315.000000 | 0.000000 | 0.000000 | 53.000000 | 32.000000 | 2.000000 | 1607.000000 | 12.000000 | 4.000000 | 22.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 38.000000 | 0.000000 | 2.000000 | 1679.000000 | 1.000000 | 754.000000 | 1.000000 | 1679.000000 |
| 4 | 1965.630037 | 1.648352 | 54730.915751 | 0.355311 | 0.875458 | 47.430403 | 410.366300 | 16.500000 | 112.236264 | 23.360806 | 16.976190 | 56.049451 | 4.087912 | 6.293040 | 2.652015 | 7.184982 | 6.177656 | 0.084249 | 0.119048 | 0.009158 | 0.025641 | 0.009158 | 0.007326 | 0.137363 | 50.369963 | 1.230769 | 2.879121 | 635.489011 | 20.217949 | 607.529304 | 0.247253 | 30.946379 |
Observations and Insights on each cluster:
From this information, I do not recognize any clear pattern that differentiate one cluster from another either. In fact, I think that the profiling qualities of each cluster don't make any sense; they don't seem to reveal anything useful from each cluster. Moreover, the distribution of the clusters is odd and confusing. We have a cluster with one observations, and two clusters that are almost completely overlapped. I think K=5 does not add anything valuable to the results we got from K=4, so I will keep K=4 as my chosen result from this model.
Let's plot the boxplot
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
# Creating boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_means_segments_5'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
# Dropping labels we got from K-Means since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['K_means_segments_5'], axis=1, inplace=True)
data.drop(['K_means_segments_5'], axis=1, inplace=True)
Let's find the silhouette score for K=5 in K-Medoids
kmedo = KMedoids(n_clusters= 5, random_state= 1) # Initializing K-Medoids with number of clusters as 5 and random_state=1
preds = kmedo.fit(data_pca).predict(data_pca) # Fit and predict K-Medoids using data_pca
score = silhouette_score(data_pca, preds) # Calculate the silhouette score
print(score) # Print the score
0.10383413504610778
Observations and Insights: The Silhouette score of K=5 in K-Medoids is very low compared to the scores we got in K-Means.
# Predicting on data_pca and adding K-Medoids cluster labels to the whole data
data['K_medoids_segments'] = preds
# Predicting on data_pca and ddding K-Medoids cluster labels to data_model
data_model['K_medoids_segments'] = preds
# Predicting on data_pca and adding K-Medoids cluster labels to data_pca
data_pca['K_medoids_segments'] = preds
# Let's check the distribution
data_model["K_medoids_segments"].value_counts()
3 603 1 587 0 467 2 302 4 268 Name: K_medoids_segments, dtype: int64
Visualizing these clusters as a histogram:
plt.figure(figsize=(15,7))
sns.histplot(x='K_medoids_segments', data=data_model)
plt.show()
Let's visualize the clusters using PCA
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "K_medoids_segments")
Observations and Insights:
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_KMedoids = data.groupby('K_medoids_segments').mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_KMedoids.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_medoids_segments | ||||||||||||||||||||||||||||||||
| 0 | 1968.661670 | 1.605996 | 76628.961456 | 0.042827 | 0.216274 | 50.927195 | 667.261242 | 75.179872 | 483.462527 | 108.940043 | 79.886510 | 83.111349 | 1.381156 | 5.509636 | 5.944325 | 8.605996 | 3.081370 | 0.092077 | 0.141328 | 0.299786 | 0.241970 | 0.034261 | 0.004283 | 0.323340 | 47.338330 | 0.259101 | 1.865096 | 1497.841542 | 21.441113 | 568.668094 | 0.809422 | 76.349673 |
| 1 | 1965.683135 | 1.654174 | 61775.618399 | 0.187394 | 0.724020 | 46.025554 | 494.132879 | 28.928450 | 186.342419 | 39.562181 | 28.495741 | 66.158433 | 3.189097 | 6.091993 | 3.945486 | 8.195911 | 4.875639 | 0.071550 | 0.120954 | 0.037479 | 0.040886 | 0.018739 | 0.010221 | 0.126065 | 50.316865 | 0.911414 | 2.565588 | 843.620102 | 21.422487 | 573.446337 | 0.289608 | 40.554090 |
| 2 | 1971.546358 | 1.625828 | 33470.874172 | 0.751656 | 0.493377 | 42.516556 | 76.447020 | 7.927152 | 37.817881 | 12.235099 | 7.725166 | 25.745033 | 3.109272 | 3.172185 | 0.867550 | 3.576159 | 7.516556 | 0.072848 | 0.013245 | 0.000000 | 0.000000 | 0.000000 | 0.019868 | 0.208609 | 44.453642 | 1.245033 | 2.870861 | 167.897351 | 10.725166 | 785.850993 | 0.086093 | 14.460202 |
| 3 | 1972.320066 | 1.643449 | 33215.016584 | 0.817579 | 0.419569 | 46.266998 | 22.242123 | 3.001658 | 13.457711 | 4.499171 | 3.084577 | 9.807629 | 1.548922 | 1.615257 | 0.270315 | 2.873964 | 6.452736 | 0.077944 | 0.006633 | 0.000000 | 0.000000 | 0.003317 | 0.009950 | 0.061360 | 43.679934 | 1.237148 | 2.880597 | 56.092869 | 6.308458 | 417.504146 | 0.087894 | 8.589409 |
| 4 | 1965.589552 | 1.716418 | 48003.895522 | 0.522388 | 0.757463 | 66.578358 | 155.432836 | 9.085821 | 51.194030 | 12.679104 | 8.641791 | 26.078358 | 2.914179 | 3.947761 | 1.343284 | 4.899254 | 5.223881 | 0.033582 | 0.082090 | 0.000000 | 0.026119 | 0.000000 | 0.000000 | 0.033582 | 50.410448 | 1.279851 | 2.996269 | 263.111940 | 13.104478 | 400.518657 | 0.141791 | 19.010658 |
Observations and Insights on each cluster:
From this information, I do not recognize any clear pattern that differentiate one cluster from another either. In fact, I think that the profiling qualities of each cluster don't make much sense, business-wise. Therefore I would not regard them as useful clusters.
Let's plot the boxplot
# Columns to use in boxplot
col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['K_medoids_segments'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
The features where these clusters differ the most are:
Cluster 0 profile:
Summary for cluster 0: This cluster is made up of high-income customers, with no kids or teenagers living with them. They are the main buyers of wine, they purchase from catalog and in store and are the only segment that have accepted any campaign. They barely visit the website and do not buy deals. They have spent the largest amount of money in products of the store and have the highest median of total number of purchases.
Cluster 1 profile:
Summary for cluster 1: This cluster is made up of medium-high income customers, with a teenager living at home with them. They are the second first segment that buys wine products the most. They are the ones that buy from the website the most of all segments, although they mainly buy in store, second place from the website, and in the third place from catalog. They have not accepted any campaign. The median of age here is 50 years old.
Cluster 2 profile:
Summary for cluster 2: This cluster is made up of low-income customers, living with a kid and a teenager at home. they buy wine in the first place, and mainly purchase in store. They visit the website the most of all segments, and buy deals too. They are the oldest clients and have not accepted any campaign.
Cluster 3 profile:
Summary for cluster 3: This cluster is made up of low-income customers, living with a teenager at home. They buy the least of any product. They purchase few deals and mainly in store. They are recent clients and they have not accepted any campaign.
Cluster 4 profile:
Summary for cluster 4: This cluster is made up of medium-income customers, living with a kid and a teenager at home. They mainly buy wine, followed by meat, gold and fish products. They purchase in the first place from the website, and in second place in store. They have not accepted any campaign, and are recent clients.
# Dropping labels we got from K-Medoids since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['K_medoids_segments'], axis=1, inplace=True)
data.drop(['K_medoids_segments'], axis=1, inplace=True)
Let's find the Cophenetic correlation for different distances with different linkage methods.
The cophenetic correlation coefficient is a correlation coefficient between the cophenetic distances (Dendrogramic distance) obtained from the tree, and the original distances used to construct the tree. It is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points.
The cophenetic distance between two observations is represented in a dendrogram by the height of the link at which those two observations are first joined. That height is the distance between the two subclusters that are merged by that link.
Cophenetic correlation is the way to compare two or more dendrograms.
Let's calculate Cophenetic correlation for each of the distance metrics with each of the linkage methods
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average"]
high_cophenet_corr = 0 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0] # Creating a list by assigning 0's to it
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(data_pca, metric=dm, method=lm) # Applying different linkages with different distance on data_pca
c, coph_dists = cophenet(Z, pdist(data_pca)) # Calculating cophenetic correlation
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c: # Checking if cophenetic correlation is higher than previous score
high_cophenet_corr = c # Appending to high_cophenet_corr list if it is higher
high_dm_lm[0] = dm # Appending its corresponding distance
high_dm_lm[1] = lm # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.8058309411978958. Cophenetic correlation for Euclidean distance and complete linkage is 0.6646459995708378. Cophenetic correlation for Euclidean distance and average linkage is 0.8649717378300196. Cophenetic correlation for Chebyshev distance and single linkage is 0.7947453694045187. Cophenetic correlation for Chebyshev distance and complete linkage is 0.6608218045726901. Cophenetic correlation for Chebyshev distance and average linkage is 0.7834968281406554. Cophenetic correlation for Mahalanobis distance and single linkage is 0.6280292997438838. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.5789238824021619. Cophenetic correlation for Mahalanobis distance and average linkage is 0.6801758232330247. Cophenetic correlation for Cityblock distance and single linkage is 0.8000297772063457. Cophenetic correlation for Cityblock distance and complete linkage is 0.7205051270317547. Cophenetic correlation for Cityblock distance and average linkage is 0.8602279087813257.
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.8649717378300196, which is obtained with Euclidean distance and average linkage.
Let's have a look at the dendrograms for different linkages with Cityblock distance
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Cityblock", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Think about it:
Let's have a look at the dendrograms for different linkages with Chebyshev distance
# Plot the dendrogram for Chebyshev distance with linkages single, complete and average.
# Hint: Use Chebyshev distance as the metric in the linkage() function
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Chebyshev", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Let's have a look at the dendrograms for different linkages with Mahalanobis distance
# Plot the dendrogram for Mahalanobis distance with linkages single, complete and average.
# Hint: Use Mahalanobis distance as the metric in the linkage() function
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Mahalanobis", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Let's have a look at the dendrograms for different linkages with Euclidean distance
# Plot the dendrogram for Euclidean distance with linkages single, complete, average and ward.
# Hint: Use Euclidean distance as the metric in the linkage() function
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca, metric="Euclidean", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Think about it:
Observations and Insights about all dendograms: I do not see that this model shows anything that could be really helpful, in addition to the other models I have tried so far. Many of the dendrograms are odd and difficult to distinguish. In most cases, the dendrograms are too many, with very few observations in each one of them.
data_pca.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2227 entries, 0 to 2226 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 2227 non-null float64 1 1 2227 non-null float64 2 2 2227 non-null float64 3 3 2227 non-null float64 4 4 2227 non-null float64 5 5 2227 non-null float64 6 6 2227 non-null float64 7 7 2227 non-null float64 8 8 2227 non-null float64 9 9 2227 non-null float64 10 10 2227 non-null float64 11 11 2227 non-null float64 12 12 2227 non-null float64 13 13 2227 non-null float64 14 14 2227 non-null float64 15 15 2227 non-null float64 16 16 2227 non-null float64 dtypes: float64(17) memory usage: 295.9 KB
# Initialize Agglomerative Clustering with affinity (distance) as Euclidean, linkage as 'Ward' with clusters=3
HCmodel = AgglomerativeClustering(n_clusters= 3, affinity = 'euclidean', linkage='ward').fit_predict(data_pca)
labels = HCmodel
# Calculate silhouette score - Hint: Use silhouette_score() function
score = silhouette_score(data_pca, labels)
print(score)
0.23452261634656826
# Add Agglomerative Clustering cluster labels to data_pca
data_pca["HC_segments"] = labels
# Add Agglomerative Clustering cluster labels to the whole data
data["HC_segments"] = labels
# Add Agglomerative Clustering cluster labels to data_model
data_model["HC_segments"] = labels
# Let's check the distribution
data_model['HC_segments'].value_counts()
1 909 2 807 0 511 Name: HC_segments, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "HC_segments")
Observations and Insights: About the distribution of observations: We can observe that these clusters are not well separated, they don't have good density and are overlapped. Although they are well balanced in terms of number of observations in each cluster, the fact that they do not differentiate very well from each other is not very helpful.
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_HC = data.groupby('HC_segments').mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_HC.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||||||||||||||||||||||
| 0 | 1968.450098 | 1.612524 | 76117.052838 | 0.033268 | 0.203523 | 51.561644 | 639.727984 | 70.739726 | 463.763209 | 101.354207 | 71.575342 | 77.469667 | 1.289628 | 5.111546 | 5.722114 | 8.082192 | 2.827789 | 0.086106 | 0.154599 | 0.299413 | 0.230920 | 0.039139 | 0.005871 | 0.299413 | 47.549902 | 0.236791 | 1.849315 | 1424.630137 | 20.205479 | 544.005871 | 0.810176 | 75.929317 |
| 1 | 1971.705171 | 1.646865 | 33512.654565 | 0.809681 | 0.449945 | 48.772277 | 30.623762 | 3.976898 | 18.013201 | 5.668867 | 4.207921 | 12.829483 | 1.869087 | 1.896590 | 0.415842 | 3.033003 | 6.500550 | 0.061606 | 0.005501 | 0.000000 | 0.000000 | 0.002200 | 0.012101 | 0.085809 | 44.294829 | 1.259626 | 2.906491 | 75.320132 | 7.214521 | 502.947195 | 0.069307 | 9.830857 |
| 2 | 1965.993804 | 1.662949 | 56618.462206 | 0.293680 | 0.765799 | 47.955390 | 402.830235 | 23.501859 | 142.576208 | 33.406444 | 24.960347 | 58.335812 | 3.485750 | 5.951673 | 3.190830 | 7.520446 | 5.586121 | 0.078067 | 0.102850 | 0.011152 | 0.032218 | 0.008674 | 0.007435 | 0.127633 | 50.006196 | 1.059480 | 2.722429 | 685.610905 | 20.148699 | 574.270136 | 0.232962 | 32.671213 |
Observations and Insights on each cluster:
Again, as in the last model, from this information I do not recognize any clear pattern that differentiate one cluster from another either. I think that the profiling qualities of each cluster don't make much sense either, business-wise.
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['HC_segments'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Cluster 0:
Summary for cluster 0: This cluster is made up of high-income customers, living with no kid or teenager at home. The buy wines the most and do not buy deals. They buy mainly in store, followed by catalog and web purchases. They are a median of 47 years old. They are the only segment that accepted a campaign. They are the segment that has spent the largest amount per purchase of all segments.
Cluster 1:
Summary for cluster 1: This cluster is made up of low-income customers, living with a teenager at home. They are the segment that spend the least and make the smallest total number of purchases, and amount per purchase. They are the latest clients, being engaged with the store with a median of 500 days. They have not accepted any campaign.
Cluster 2:
Summary for cluster 2: This cluster is made up of medium-income customers, living with a kid and a teenager at home. They mainly purchase wine, purchase in store and visit the website frequently. They make the largest number of total purchases, and are the earliest customers, engaged with the store with a median of almost 600 days. They have not accepted any campaign.
Observations and Insights:
# Dropping labels we got from Agglomerative Clustering since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['HC_segments'], axis=1, inplace=True)
data.drop(['HC_segments'], axis=1, inplace=True)
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
# Initializing lists
eps_value = [2,3] # Taking random eps value
min_sample_values = [6,20] # Taking random min_sample value
# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0 # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0] # Assigning 0's to the high_i_j list
key = res.keys() # Assigning dictionary keys to a variable called key
for i in key:
z = res[i] # Assigning dictionary values of each i to z
for j in z:
db = DBSCAN(eps=i, min_samples=j).fit(data_pca) # Applying DBSCAN to each of the combination in dictionary
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
silhouette_avg = silhouette_score(data_pca, labels) # Finding silhouette score
print(
"For eps value =" + str(i),
"For min sample =" + str(j),
"The average silhoutte_score is :",
silhouette_avg, # Printing the silhouette score for each of the combinations
)
if high_silhouette_avg < silhouette_avg: # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j
high_i_j[0] = i
high_i_j[1] = j
For eps value =2 For min sample =6 The average silhoutte_score is : 0.19623335001342507 For eps value =2 For min sample =20 The average silhoutte_score is : 0.3381054125658042 For eps value =3 For min sample =6 The average silhoutte_score is : 0.20036243530070832 For eps value =3 For min sample =20 The average silhoutte_score is : 0.341052707104241
# Printing the highest silhouette score
print("Highest_silhoutte_avg is {} for eps = {} and min sample = {}".format(high_silhouette_avg, high_i_j[0], high_i_j[1]))
Highest_silhoutte_avg is 0 for eps = 3 and min sample = 20
# Initializing lists
eps_value = [1,2] # Taking random eps value
min_sample_values = [10,33] # Taking random min_sample value
# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0 # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0] # Assigning 0's to the high_i_j list
key = res.keys() # Assigning dictionary keys to a variable called key
for i in key:
z = res[i] # Assigning dictionary values of each i to z
for j in z:
db = DBSCAN(eps=i, min_samples=j).fit(data_pca) # Applying DBSCAN to each of the combination in dictionary
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
silhouette_avg = silhouette_score(data_pca, labels) # Finding silhouette score
print(
"For eps value =" + str(i),
"For min sample =" + str(j),
"The average silhoutte_score is :",
silhouette_avg, # Printing the silhouette score for each of the combinations
)
if high_silhouette_avg < silhouette_avg: # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j
high_i_j[0] = i
high_i_j[1] = j
For eps value =1 For min sample =10 The average silhoutte_score is : 0.025643772778777187 For eps value =1 For min sample =33 The average silhoutte_score is : 0.1031085655564581 For eps value =2 For min sample =10 The average silhoutte_score is : 0.33983556606490745 For eps value =2 For min sample =33 The average silhoutte_score is : 0.33536290923972917
# Initializing lists
eps_value = [1,2] # Taking random eps value
min_sample_values = [44,70] # Taking random min_sample value
# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0 # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0] # Assigning 0's to the high_i_j list
key = res.keys() # Assigning dictionary keys to a variable called key
for i in key:
z = res[i] # Assigning dictionary values of each i to z
for j in z:
db = DBSCAN(eps=i, min_samples=j).fit(data_pca) # Applying DBSCAN to each of the combination in dictionary
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
silhouette_avg = silhouette_score(data_pca, labels) # Finding silhouette score
print(
"For eps value =" + str(i),
"For min sample =" + str(j),
"The average silhoutte_score is :",
silhouette_avg, # Printing the silhouette score for each of the combinations
)
if high_silhouette_avg < silhouette_avg: # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j
high_i_j[0] = i
high_i_j[1] = j
For eps value =1 For min sample =44 The average silhoutte_score is : 0.05684806462872944 For eps value =1 For min sample =70 The average silhoutte_score is : -0.09151027357324977 For eps value =2 For min sample =44 The average silhoutte_score is : 0.3341822157081539 For eps value =2 For min sample =70 The average silhoutte_score is : 0.3272339012067853
Now, let's apply DBSCAN using the hyperparameter values we have received above.
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps = 3, min_samples= 20)
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data['DBS_segments'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_model
data_model['DBS_segments'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_pca
data_pca['DBS_segments'] = dbs.fit_predict(data_pca)
# Let's check the distribution
data_pca["DBS_segments"].value_counts()
0 1912 -1 315 Name: DBS_segments, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "DBS_segments")
Observations and Insights: We can observe from this scatter plot that there are two clusters, but they are not well separated, and the density of cluster -1 is very low. So, this solution doesn't help to find good insights. Let's try with the next pair of numbers for parameters eps and min_samples I tried before that got a reasonable Silhouette score.
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps = 2, min_samples= 10)
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data['DBS_segments'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_model
data_model['DBS_segments'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_pca
data_pca['DBS_segments'] = dbs.fit_predict(data_pca)
# Let's check the distribution
data_pca["DBS_segments"].value_counts()
0 1344 -1 883 Name: DBS_segments, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "DBS_segments")
Observations and Insights: Here the two cluster are much better defined, although they still overlapped a little bit. The observationes are distribuited in a more balanced way than before. Let's try with the third pair of numbers for parameters eps and min_samples.
# Apply DBSCAN using the above hyperparameter values
dbs = DBSCAN(eps = 2, min_samples= 44)
# fit_predict on data_pca and add DBSCAN cluster labels to the whole data
data['DBS_segments'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_model
data_model['DBS_segments'] = dbs.fit_predict(data_pca)
# fit_predict on data_pca and add DBSCAN cluster labels to data_pca
data_pca['DBS_segments'] = dbs.fit_predict(data_pca)
# Let's check the distribution
data_pca["DBS_segments"].value_counts()
0 1225 -1 1002 Name: DBS_segments, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "DBS_segments")
Observations and Insights: Of these three solutions, I like the last one the best. It seems it has a little less overlap than the last solution, but still very similar patterns. I will profile the clusters of this solution to find out more about them.
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_DBSCAN = data.groupby('DBS_segments').mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_DBSCAN.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DBS_segments | ||||||||||||||||||||||||||||||||
| -1 | 1967.270459 | 1.630739 | 68576.822355 | 0.126747 | 0.467066 | 49.382236 | 582.637725 | 51.186627 | 326.831337 | 73.197605 | 52.981038 | 75.341317 | 2.354291 | 5.852295 | 4.928144 | 8.298403 | 4.175649 | 0.092814 | 0.146707 | 0.161677 | 0.141717 | 0.026946 | 0.007984 | 0.225549 | 48.729541 | 0.593812 | 2.224551 | 1162.175649 | 21.433134 | 568.909182 | 0.569860 | 58.189258 |
| 0 | 1970.212245 | 1.656327 | 37825.257143 | 0.704490 | 0.541224 | 48.898776 | 78.382857 | 6.073469 | 33.413061 | 8.620408 | 6.086531 | 18.640000 | 2.295510 | 2.673469 | 0.766531 | 3.788571 | 6.267755 | 0.057143 | 0.016327 | 0.000000 | 0.001633 | 0.001633 | 0.009796 | 0.088163 | 45.787755 | 1.245714 | 2.902041 | 151.216327 | 9.524082 | 513.106122 | 0.076735 | 12.894826 |
Observations and Insights on each cluster:
Cluster -1 has 1002 observations. It corresponds to a segment of customers that have the highest mean in columns Income, Recency, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, all AcceptedCmp columns, Response, Age, Expenses, NumTotalPurchases, Engaged_in_days, TotalAcceptedCmp, AmountPerPurchase.
Cluster 0 has 1225 observations. It corresponds to a segment of customers that have the highest mean in columns Year_Birth, Marital_Status, Kidhome, Teenhome, NumWebVisitsMonth, Complain, Kids, Family_Size.
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['DBS_segments'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
The features where these clusters differ the most are:
Cluster -1:
Summary for cluster -1: This cluster is made up of high-income customers, living with no kid but a teenager at home. They buy more of all products than the second cluster, and purchase mainly in store, followed by web purchases. They are the oldest segment, with a median of 48 years old. They have accepted a campaign and make the largest amount per purchase.
Cluster 0:
Summary for cluster 0: This cluster is made up of low-income customers, living with a kid but a teenager at home. They buy less of all products than the second cluster, and purchase mainly wine, followed by meat. They buy in store, followed by web purchases. They visit the website often. They spent much less than the other segment, and have a lower number of total purchases, with a median of 8 (versus a median of 21 in the other segment). They are a median of 44 years old. They have not accepted any campaign.
Insights: Although from the scatter plot this solution seemed worthy of exploring, it does not add anything that makes the clusters better defined than what the models we applied before showed.
# Dropping labels we got from DBSCAN since we will be using PCA data for prediction
# Hint: Use axis=1 and inplace=True
data_pca.drop(['DBS_segments'], axis=1, inplace=True)
data.drop(['DBS_segments'], axis=1, inplace=True)
Let's find the silhouette score for K=5 in Gaussian Mixture
gmm = GaussianMixture(n_components = 5, random_state = 1) # Initialize Gaussian Mixture Model with number of clusters as 5 and random_state=1
preds = gmm.fit_predict(data_pca) # Fit and predict Gaussian Mixture Model using data_pca
score = silhouette_score(data_pca, preds) # Calculate the silhouette score
print(score) # Print the score
0.09112454908758941
Observations and Insights: The Silhouette score it is very low. In fact, is the lowest of all we have gotten before in this project.
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to the whole data
data['GMM_segments'] = preds
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_model
data_model['GMM_segments'] = preds
# Predicting on data_pca and add Gaussian Mixture Model cluster labels to data_pca
data_pca['GMM_segments'] = preds
# Let's check the distribution
data_model['GMM_segments'].value_counts()
1 692 4 579 2 468 0 454 3 34 Name: GMM_segments, dtype: int64
Let's visualize the clusters using PCA.
# Hint: Use PCA_PLOT function created above
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca, "GMM_segments")
Observations and Insights: This scatter plot shows clusters very messy and overlapped. Is not possible to observe any pattern in the data that could be helpful for this business problem.
# Take the cluster-wise mean of all the variables. Hint: First group 'data' by cluster labels column and then find mean
cluster_profile_GMM = data.groupby('GMM_segments').mean()
# Highlight the maximum average value among all the clusters for each of the variables
cluster_profile_GMM.style.highlight_max(color="lightgreen", axis=0)
| Year_Birth | Marital_Status | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GMM_segments | ||||||||||||||||||||||||||||||||
| 0 | 1966.048458 | 1.669604 | 59928.936123 | 0.204846 | 0.777533 | 48.715859 | 416.317181 | 38.557269 | 161.676211 | 51.149780 | 38.832599 | 65.808370 | 3.314978 | 6.676211 | 3.814978 | 9.059471 | 5.079295 | 0.066079 | 0.081498 | 0.008811 | 0.011013 | 0.004405 | 0.011013 | 0.085903 | 49.951542 | 0.982379 | 2.651982 | 772.341410 | 22.865639 | 562.374449 | 0.171806 | 33.257678 |
| 1 | 1972.228324 | 1.650289 | 31642.988439 | 0.868497 | 0.436416 | 48.234104 | 17.771676 | 2.763006 | 11.744220 | 4.401734 | 2.927746 | 9.026012 | 1.736994 | 1.563584 | 0.258671 | 2.864162 | 6.390173 | 0.075145 | 0.002890 | 0.000000 | 0.000000 | 0.002890 | 0.014451 | 0.069364 | 43.771676 | 1.304913 | 2.955202 | 48.634393 | 6.423410 | 477.748555 | 0.080925 | 7.425567 |
| 2 | 1968.023504 | 1.660256 | 44576.096154 | 0.549145 | 0.655983 | 49.549145 | 139.200855 | 8.504274 | 52.175214 | 11.405983 | 8.209402 | 32.623932 | 3.049145 | 3.952991 | 1.305556 | 4.510684 | 6.297009 | 0.055556 | 0.049145 | 0.000000 | 0.014957 | 0.000000 | 0.006410 | 0.138889 | 47.976496 | 1.205128 | 2.865385 | 252.119658 | 12.818376 | 560.602564 | 0.119658 | 18.997444 |
| 3 | 1972.176471 | 1.647059 | 66295.176471 | 0.088235 | 0.235294 | 49.500000 | 624.676471 | 52.911765 | 341.764706 | 55.588235 | 56.529412 | 78.470588 | 0.882353 | 3.705882 | 4.058824 | 4.764706 | 5.558824 | 0.235294 | 0.382353 | 0.588235 | 0.470588 | 0.205882 | 0.000000 | 0.500000 | 43.823529 | 0.323529 | 1.970588 | 1209.941176 | 13.411765 | 600.176471 | 1.882353 | 138.326402 |
| 4 | 1967.630397 | 1.606218 | 73971.644214 | 0.062176 | 0.278066 | 50.112263 | 677.255613 | 57.915371 | 433.246978 | 87.060449 | 60.661485 | 76.454231 | 1.739206 | 5.267703 | 5.556131 | 7.924007 | 3.450777 | 0.081174 | 0.158895 | 0.238342 | 0.200345 | 0.031088 | 0.003454 | 0.284974 | 48.369603 | 0.340242 | 1.946459 | 1392.594128 | 20.487047 | 569.799655 | 0.709845 | 69.551674 |
Observations and Insights on each cluster:
Let's plot the boxplot
# Create boxplot for each of the variables
all_col = col_for_box
plt.figure(figsize = (30, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=data[variable], x=data['GMM_segments'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
The features where these clusters differ the most are:
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
Aditional Insights from these clusters: These clusters profiles differ in many ways from the clusters we have seen in the models before. I think these cluster do not help in getting a better picture of the differences between them.
Executive summary The marketing team approached our data science team asking for help, because their efforts have been unsuccessful. The goal was to find the best possible customer segmentation, using the customer base they provided. This database was 2.240 entries and 27 columns originally. After exploring and engineering the data, and applying the models, some interesting key takeaways have been found, that are conveyed in the proposed solution as follows.
Proposed solution After applying different models, four customer segments were found. They are being chosen because is the only solution that separates the customers based on their purchase behavior (type of product they buy and amount spent in total) and divides them into four clear clusters regarding number of observations each one has and the quality of their profiling.
These for clusters are: Cluster 0:
Reasons for choosing this segmentation solution:
How did I get to this solution?
Model chosen: I chose K-Means, K=4, because it was the only algorithm that provided a reasonable number of cluster with a reasonable number of observations on each one of them, and that gave a solution where two different groups of products were alocated in different clusters. I think this is crucial because it is an essential part of the customers' purchase behavior. All the other solutions alocated all the products to one cluster, loosing this edge. Except for K-Means, K=5, but I did not choose that one because the distribution of the clusters where less useful than K=4, in profiling and practical terms.
Recommendation for implementation: